library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(Ecdat)
## Loading required package: Ecfun
##
## Attaching package: 'Ecfun'
##
## The following object is masked from 'package:base':
##
## sign
##
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(FNN)
library(ggrepel)
library(e1071)
library(gains)
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(xgboost)
##
## Attaching package: 'xgboost'
##
## The following object is masked from 'package:dplyr':
##
## slice
library(ROCR)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:randomForest':
##
## combine
##
## The following object is masked from 'package:dplyr':
##
## combine
library(Hmisc)
##
## Attaching package: 'Hmisc'
##
## The following object is masked from 'package:e1071':
##
## impute
##
## The following objects are masked from 'package:dplyr':
##
## src, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, units
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
##
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
##
## The following object is masked from 'package:base':
##
## Recall
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Set your directory. Select your own folder to load the data.
setwd("~/Library/CloudStorage/OneDrive-Personal/Boston University/MET AD699 Data Mining for Business Analytics/Final project")
train_csv <- read_csv("train.csv.csv")
## Rows: 74111 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): property_type, room_type, amenities, bed_type, cancellation_polic...
## dbl (10): id, log_price, accommodates, bathrooms, latitude, longitude, numb...
## lgl (4): cleaning_fee, host_has_profile_pic, host_identity_verified, insta...
## date (3): first_review, host_since, last_review
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter for NYC
df <- read.csv("train.csv.csv")
nyc_data <- df %>%
filter(city == "NYC")
# Handling NA values
missing_values <- colSums(is.na(nyc_data))
nyc_data[nyc_data == ""] <- NA
head(nyc_data)
## id log_price property_type room_type
## 1 6901257 5.010635 Apartment Entire home/apt
## 2 6304928 5.129899 Apartment Entire home/apt
## 3 7919400 4.976734 Apartment Entire home/apt
## 4 5578513 4.605170 Apartment Private room
## 5 17589436 4.882802 Apartment Entire home/apt
## 6 18224863 4.595120 House Entire home/apt
## amenities
## 1 {"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Essentials,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}
## 2 {"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Washer,Dryer,"Smoke detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}
## 3 {TV,"Cable TV","Wireless Internet","Air conditioning",Kitchen,Breakfast,"Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_50"}
## 4 {Internet,"Wireless Internet","Air conditioning",Kitchen,"Pets allowed","Buzzer/wireless intercom",Heating,Washer,Dryer,"Smoke detector",Essentials,Hangers,"Hair dryer","Laptop friendly workspace","translation missing: en.hosting_amenity_50"}
## 5 {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating,"Smoke detector","Carbon monoxide detector",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace"}
## 6 {TV,"Wireless Internet","Air conditioning",Kitchen,Breakfast,Heating,"Smoke detector","Fire extinguisher",Essentials,"Lock on bedroom door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","Private entrance","Hot water","Bed linens","Extra pillows and blankets",Microwave,Refrigerator,"Dishes and silverware","Cooking basics",Oven,Stove}
## accommodates bathrooms bed_type cancellation_policy cleaning_fee city
## 1 3 1 Real Bed strict True NYC
## 2 7 1 Real Bed strict True NYC
## 3 5 1 Real Bed moderate True NYC
## 4 2 1 Real Bed strict True NYC
## 5 2 1 Real Bed strict True NYC
## 6 8 1 Real Bed strict True NYC
## description
## 1 Beautiful, sunlit brownstone 1-bedroom in the loveliest neighborhood in Brooklyn. Blocks from the promenade and Brooklyn Bridge Park, with their stunning views of Manhattan, and from the great shopping and food.
## 2 Enjoy travelling during your stay in Manhattan. My place is centrally located near Times Square and Central Park with easy access to main subways as well as walking distance to many popular restaurants and bus tours. My place is close to the subway, Totto Ramen, Hell's Kitchen, Ippudo Westside, Empanada Mama, Intrepid Sea, Air & Space Museum. My place has three true bedrooms and one bathroom. The kitchen is stocked with stainless steel appliances like the Keurig machine. The living room is spacious and can accommodate another person thanks to the pull out bed. My place is centrally located to some of the top attractions in the city. Feel free to explore the entire apartment and do not worry about sharing the space with any strangers. This is all yours during your stay. I am available via text/email/phone for anything you might need. - Times Square - Rockefeller Plaza - Central Park - 5th Avenue Shopping -Broadway Theater District - Empire State Building - Hudson River Express Subwa
## 3 The Oasis comes complete with a full backyard with outdoor furniture to make the most of this summer vacation!! The unit has high ceilings, a completed renovation throughout, beautiful flood lighting and exposed brick! Best part, total seclusion. You share with no one! The entire unit is yours during your stay. It's a fully furnished apartment that can hold up to 5 people. The only items you need are a toothbrush and your luggage!!! The unit has high ceilings, a completed renovation throughout, beautiful flood lighting and exposed brick! Not to mention the large backyard complete with ourdoor furniture. Best part, total seclusion. You share with no one! The entire unit is yours during your stay. The entire unit and backyard Garden area My assistant is available off site via phone. Other than that, you will be alone for your stay. The neighborhood of central Harlem is very diverse and full of culture! We are a few blocks from Historic 125th Street, The Apollo Theatre, The Schomburg
## 4 This is a bright bedroom in an historic building with exposed brick walls and modern appointments, situated in the most vibrant and eclectic neighborhood in Manhattan! You'll get an authentic experience during your stay in New York here. I have a room available in a lovely apartment in the East Village, close to the subway (5-10 minutes) and SoHo, Little Italy, and the Lower East Side (Each about 15 minutes). Also located nearby are the eclectic and Lively St. Marks Place and Tompkins Square Park. The room has a full-size bed, a large wardrobe, bookshelves, and a closet, and it has two large windows with plenty of light. The apartment has a fully-equipped kitchen, a living room, a bathroom, a washer, and a dryer, all of which guests are welcome to use. The East Village is a great neighborhood with ample nightlife, and borders many other well-known Manhattan neighborhoods each with unique restaurants, shops, and character. It is minutes away from the 2nd Avenue station for the F
## 5 I am renting a very spacious, unique, bright L shape studio filled with character for perfect NYC vibe & getaway. Apt is located Lower East Side (LES) walking distant SOHO,Nolita, Chinatown, lots of restaurants and bars and close to the F,B,D,J,M,Z . Hello all , I am happy to rent my apartment to respectful, clean individuals. I love my apartment and want to make sure it is in good hands while away. My apt. is located one of the most convenient and attractive neighborhood in Manhattan, Lower East Side (LES). It's a very quiet apartment best fits 2 guests. ( one Queen size real bed 60" x 80" and one single day bed) 15 feet wide window provides great light and eastern views of downtown Manhattan, lying in bed and gazing out the window is lovely. Furniture's are carefully selected by award winner interior designer and items within the space give the whole apartment a colorful , light and airy / spacious feeling. Any questions I am happy to answer and after 14+ years living in New Yor
## 6 Huge 3 bedroom apartment situated in Flatbush, Brooklyn. Average 8 mins walk to train station. Average 30 mins to Manhattan and 10 mins ride to Barclays Center, Prospect Park, Brooklyn Museum, Kings Theatre and Williamsburg.This apartment minimun booking amount are three guests. Spacious 3 bedroom with lots of closet space. Private entrance Full kitchen Large living room Large bedrooms Wifi Queen size bed Private entrance Yes Multicultural neighborhood, shopping areas as well as restaurants that are in close proximity. Shopping and fitness centers within walking distance. Average 8 mins walk to train station and 2 mins to the bus. Average 30 mins to Manhattan and 10 mins ride to Barclays Center, Prospect Park and Brooklyn Museum. 10 mins walk to Kings Theatre. Bus or train to Williamsburg.
## first_review host_has_profile_pic host_identity_verified host_response_rate
## 1 2016-06-18 t t <NA>
## 2 2017-08-05 t f 100%
## 3 2017-04-30 t t 100%
## 4 2013-04-28 t t 100%
## 5 2015-09-26 t t 71%
## 6 2017-07-30 t f 100%
## host_since instant_bookable last_review latitude longitude
## 1 2012-03-26 f 2016-07-18 40.69652 -73.99162
## 2 2017-06-19 t 2017-09-23 40.76612 -73.98904
## 3 2016-10-25 t 2017-09-14 40.80811 -73.94376
## 4 2013-03-27 f 2016-11-15 40.72388 -73.98388
## 5 2015-08-17 f 2016-12-12 40.71909 -73.99028
## 6 2017-05-09 t 2017-09-15 40.64377 -73.95085
## name neighbourhood
## 1 Beautiful brownstone 1-bedroom Brooklyn Heights
## 2 Superb 3BR Apt Located Near Times Square Hell's Kitchen
## 3 The Garden Oasis Harlem
## 4 Large East Village Bedroom To Let! Alphabet City
## 5 Amazing LES apt - cool, bright... Lower East Side
## 6 3 Bedroom Apartment for Small Group or Family Flatbush
## number_of_reviews review_scores_rating
## 1 2 100
## 2 6 93
## 3 10 92
## 4 82 93
## 5 26 86
## 6 5 72
## thumbnail_url
## 1 https://a0.muscache.com/im/pictures/6d7cbbf7-c034-459c-bc82-6522c957627c.jpg?aki_policy=small
## 2 https://a0.muscache.com/im/pictures/348a55fe-4b65-452a-b48a-bfecb3b58a66.jpg?aki_policy=small
## 3 https://a0.muscache.com/im/pictures/6fae5362-9e3a-4fa9-aa54-bbd5ea26538d.jpg?aki_policy=small
## 4 https://a0.muscache.com/im/pictures/21726900/19c843e1_original.jpg?aki_policy=small
## 5 <NA>
## 6 https://a0.muscache.com/im/pictures/e9baba99-e67a-413c-9986-9386f8c5fa2f.jpg?aki_policy=small
## zipcode bedrooms beds
## 1 11201 1 1
## 2 10019 3 3
## 3 10027 1 3
## 4 10009.0 1 1
## 5 10002 1 2
## 6 11226.0 3 3
We chose “NYC” for our case study and, as a first step, filtered the dataset based on the “city” column. We then began the data preparation by checking for missing values. Since blank cells are not automatically recognized as missing (NA) in R, we converted all empty strings to NA using the command df[df == “”] <- NA. This step ensures that R can correctly detect and handle missing data during analysis. We used colSums(is.na(df)) to count the number of missing values in each column. Addressing missing data is essential to avoid bias or errors in our analysis and to maintain the reliability of any statistical models we build from the dataset.
# II. Summary Statistics
# Compute summary statistics
summary_stats <- nyc_data %>% summarise(
mean_value = mean(log_price, na.rm = TRUE),
median_value = median(log_price, na.rm = TRUE),
sd_value = sd(log_price, na.rm = TRUE),
min_value = min(log_price, na.rm = TRUE),
max_value = max(log_price, na.rm = TRUE)
)
print(summary_stats)
## mean_value median_value sd_value min_value max_value
## 1 4.71934 4.65396 0.6615672 0 7.600402
The summary statistics for log_price in the NYC dataset reveal key characteristics of the price distribution. The mean log price is 4.719, while the median is 4.653, indicating that the distribution is slightly right-skewed, as the mean is slightly higher than the median. The standard deviation (0.66) suggests moderate variability in property prices.
The minimum log price is 0, which is unusual and may indicate missing or incorrect data entries that need further investigation. On the other hand, the maximum log price is 7.60, showing the upper bound of property prices in the dataset.
# III. Visualization
str(nyc_data)
## 'data.frame': 32349 obs. of 29 variables:
## $ id : int 6901257 6304928 7919400 5578513 17589436 18224863 16679342 14122244 14490881 4734170 ...
## $ log_price : num 5.01 5.13 4.98 4.61 4.88 ...
## $ property_type : chr "Apartment" "Apartment" "Apartment" "Apartment" ...
## $ room_type : chr "Entire home/apt" "Entire home/apt" "Entire home/apt" "Private room" ...
## $ amenities : chr "{\"Wireless Internet\",\"Air conditioning\",Kitchen,Heating,\"Family/kid friendly\",Essentials,\"Hair dryer\",I"| __truncated__ "{\"Wireless Internet\",\"Air conditioning\",Kitchen,Heating,\"Family/kid friendly\",Washer,Dryer,\"Smoke detect"| __truncated__ "{TV,\"Cable TV\",\"Wireless Internet\",\"Air conditioning\",Kitchen,Breakfast,\"Buzzer/wireless intercom\",Heat"| __truncated__ "{Internet,\"Wireless Internet\",\"Air conditioning\",Kitchen,\"Pets allowed\",\"Buzzer/wireless intercom\",Heat"| __truncated__ ...
## $ accommodates : int 3 7 5 2 2 8 2 6 4 3 ...
## $ bathrooms : num 1 1 1 1 1 1 1.5 1 1 1 ...
## $ bed_type : chr "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
## $ cancellation_policy : chr "strict" "strict" "moderate" "strict" ...
## $ cleaning_fee : chr "True" "True" "True" "True" ...
## $ city : chr "NYC" "NYC" "NYC" "NYC" ...
## $ description : chr "Beautiful, sunlit brownstone 1-bedroom in the loveliest neighborhood in Brooklyn. Blocks from the promenade and"| __truncated__ "Enjoy travelling during your stay in Manhattan. My place is centrally located near Times Square and Central Par"| __truncated__ "The Oasis comes complete with a full backyard with outdoor furniture to make the most of this summer vacation!!"| __truncated__ "This is a bright bedroom in an historic building with exposed brick walls and modern appointments, situated in "| __truncated__ ...
## $ first_review : chr "2016-06-18" "2017-08-05" "2017-04-30" "2013-04-28" ...
## $ host_has_profile_pic : chr "t" "t" "t" "t" ...
## $ host_identity_verified: chr "t" "f" "t" "t" ...
## $ host_response_rate : chr NA "100%" "100%" "100%" ...
## $ host_since : chr "2012-03-26" "2017-06-19" "2016-10-25" "2013-03-27" ...
## $ instant_bookable : chr "f" "t" "t" "f" ...
## $ last_review : chr "2016-07-18" "2017-09-23" "2017-09-14" "2016-11-15" ...
## $ latitude : num 40.7 40.8 40.8 40.7 40.7 ...
## $ longitude : num -74 -74 -73.9 -74 -74 ...
## $ name : chr "Beautiful brownstone 1-bedroom" "Superb 3BR Apt Located Near Times Square" "The Garden Oasis" "Large East Village Bedroom To Let!" ...
## $ neighbourhood : chr "Brooklyn Heights" "Hell's Kitchen" "Harlem" "Alphabet City" ...
## $ number_of_reviews : int 2 6 10 82 26 5 57 14 4 40 ...
## $ review_scores_rating : num 100 93 92 93 86 72 89 100 100 88 ...
## $ thumbnail_url : chr "https://a0.muscache.com/im/pictures/6d7cbbf7-c034-459c-bc82-6522c957627c.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/348a55fe-4b65-452a-b48a-bfecb3b58a66.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/6fae5362-9e3a-4fa9-aa54-bbd5ea26538d.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/21726900/19c843e1_original.jpg?aki_policy=small" ...
## $ zipcode : chr "11201" "10019" "10027" "10009.0" ...
## $ bedrooms : num 1 3 1 1 1 3 1 3 2 1 ...
## $ beds : num 1 3 3 1 2 3 1 3 2 2 ...
# 1. Histogram of log_price
ggplot(nyc_data, aes(x = log_price)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
labs(title = "Distribution of Log Price", x = "Log Price", y = "Count")
# 2. Boxplot of log_price
ggplot(nyc_data, aes(y = log_price)) +
geom_boxplot(fill = "red", alpha = 0.5) +
labs(title = "Boxplot of Log Price", y = "Log Price")
# 3. Scatter plot of log_price vs No. of Bedroom
# Remove rows with missing or non-finite values before plotting
nyc_clean <- nyc_data %>%
filter(!is.na(bedrooms), !is.na(log_price), is.finite(log_price))
ggplot(nyc_clean, aes(x = bedrooms, y = log_price)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Log Price vs Bedrooms", x = "Bedrooms", y = "Log Price")
## `geom_smooth()` using formula = 'y ~ x'
# 4. Point plot log price by scores rating
# Clean the data for this specific plot since some data were missing
nyc_clean2 <- nyc_data %>%
filter(!is.na(log_price), !is.na(review_scores_rating),
is.finite(log_price), is.finite(review_scores_rating))
ggplot(nyc_clean2, aes(x = log_price, y = review_scores_rating)) +
geom_point(size = 1, alpha = 0.5) +
labs(title = "Log Price vs Review Scores Rating",
x = "Log Price",
y = "Review Scores Rating")
# 5. Density plot of log_price
ggplot(nyc_data, aes(x = log_price)) +
geom_density(fill = "green", alpha = 0.5) +
labs(title = "Density Plot of Log Price", x = "Log Price", y = "Density")
The five visualizations illustrate different perspectives on the distribution and relationship of property prices (log_price). #1,The histogram presents the frequency distribution of log_price, showing a roughly normal distribution with a peak around 4-5. #2, The boxplot highlights the presence of outliers, both in the lower and upper ranges, which could indicate extreme property values that may need further investigation.
#3, The scatter plot of log price against the number of bedrooms suggests a positive correlation between the two variables, as expected in real estate pricing. The linear trend line reinforces this relationship, indicating that as the number of bedrooms increases, property prices generally rise. However, the plot also reveals some variability and outliers, meaning additional factors may influence price beyond just bedroom count. #4, The scatterplot has a massive concentration of points and is at review_scores_rating = 100, especially in the log_price range of 4.0 to 5.5. The scatterplot likely shows a cloud of points without a clear trend line. That suggests no strong linear relationship between price and rating. Customers tend to rate highly whether the price is low, medium, or high. Review score isn’t a strong discriminator of price.
# Prepare the regression data by selecting relevant variables and handling missing values
regression_data <- dplyr::select(nyc_data,
log_price, bedrooms, bathrooms, number_of_reviews, review_scores_rating, accommodates,
room_type, cancellation_policy, host_response_rate, instant_bookable, latitude, longitude) %>%
drop_na()
# Impute missing values for continuous variables with median
regression_data$bedrooms <- ifelse(is.na(regression_data$bedrooms),
median(regression_data$bedrooms, na.rm = TRUE),
regression_data$bedrooms)
regression_data$bathrooms <- ifelse(is.na(regression_data$bathrooms),
median(regression_data$bathrooms, na.rm = TRUE),
regression_data$bathrooms)
regression_data$review_scores_rating <- ifelse(is.na(regression_data$review_scores_rating),
median(regression_data$review_scores_rating, na.rm = TRUE),
regression_data$review_scores_rating)
regression_data$host_response_rate <- ifelse(is.na(regression_data$host_response_rate),
median(regression_data$host_response_rate, na.rm = TRUE),
regression_data$host_response_rate)
# Impute missing values for categorical variables with mode
mode_function <- function(x) {
uniq_x <- unique(x)
uniq_x[which.max(tabulate(match(x, uniq_x)))]
}
regression_data$room_type <- ifelse(is.na(regression_data$room_type),
mode_function(regression_data$room_type),
regression_data$room_type)
regression_data$cancellation_policy <- ifelse(is.na(regression_data$cancellation_policy),
mode_function(regression_data$cancellation_policy),
regression_data$cancellation_policy)
regression_data$instant_bookable <- ifelse(is.na(regression_data$instant_bookable),
mode_function(regression_data$instant_bookable),
regression_data$instant_bookable)
# Convert categorical variables to factors
regression_data$room_type <- as.factor(regression_data$room_type)
regression_data$cancellation_policy <- as.factor(regression_data$cancellation_policy)
regression_data$instant_bookable <- as.factor(regression_data$instant_bookable)
To make a model that predicts the log price of Airbnb listings, we first looked at all the columns in the dataset. We picked the ones that we thought could affect the price — like how many bedrooms and bathrooms the listing has, the accommodation, how many reviews it has and its review score. We also added the location (latitude and longitude), the type of room (like entire home or shared room) and the cancellation policy. We removed things that wouldn’t help like the listing’s name, description or ID number — those don’t really affect price. We changed the room type and cancellation policy into categories that the model could understand.
After that, we ran the regression model and looked at each variable’s p-value. If the p-value was higher than 0.05 that means the variable probably doesn’t affect the price much — so we removed it. For example, cancellation_policy moderate had a high p-value, so we didn’t keep it. We also made sure that our variables weren’t too similar to each other using VIF.
To check how good our model is, we looked at two things:
R-squared tells us how much of the price differences model can explain.
RMSE tells us how far off our model’s guesses are from the real prices.
# Fit the linear regression model
model <- lm(log_price ~ bedrooms + bathrooms + number_of_reviews +
review_scores_rating + accommodates +
room_type + cancellation_policy +
latitude + longitude,
data = regression_data)
# View the regression summary
options(scipen = 999)
summary(model)
##
## Call:
## lm(formula = log_price ~ bedrooms + bathrooms + number_of_reviews +
## review_scores_rating + accommodates + room_type + cancellation_policy +
## latitude + longitude, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6191 -0.2382 -0.0083 0.2224 2.4731
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -345.45396394 5.32716608 -64.848
## bedrooms 0.09921290 0.00524634 18.911
## bathrooms 0.11289394 0.00753276 14.987
## number_of_reviews -0.00031425 0.00007381 -4.257
## review_scores_rating 0.00473564 0.00037595 12.596
## accommodates 0.07784139 0.00224690 34.644
## room_typePrivate room -0.60274721 0.00683231 -88.220
## room_typeShared room -0.99729137 0.01832176 -54.432
## cancellation_policymoderate -0.00623078 0.00848373 -0.734
## cancellation_policystrict 0.02617924 0.00771745 3.392
## cancellation_policysuper_strict_30 0.52560909 0.22382877 2.348
## cancellation_policysuper_strict_60 1.65563878 0.38742121 4.273
## latitude 1.50581934 0.05155542 29.208
## longitude -3.89728759 0.06393523 -60.957
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## bedrooms < 0.0000000000000002 ***
## bathrooms < 0.0000000000000002 ***
## number_of_reviews 0.0000208 ***
## review_scores_rating < 0.0000000000000002 ***
## accommodates < 0.0000000000000002 ***
## room_typePrivate room < 0.0000000000000002 ***
## room_typeShared room < 0.0000000000000002 ***
## cancellation_policymoderate 0.462690
## cancellation_policystrict 0.000695 ***
## cancellation_policysuper_strict_30 0.018871 *
## cancellation_policysuper_strict_60 0.0000193 ***
## latitude < 0.0000000000000002 ***
## longitude < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3873 on 19401 degrees of freedom
## Multiple R-squared: 0.6454, Adjusted R-squared: 0.6452
## F-statistic: 2717 on 13 and 19401 DF, p-value: < 0.00000000000000022
# Calculate RMSE
predictions <- predict(model, newdata = regression_data)
rmse <- sqrt(mean((predictions - regression_data$log_price)^2))
cat("RMSE: ", rmse, "\n")
## RMSE: 0.3871815
# Evaluate R-squared
rsq <- summary(model)$r.squared
cat("R-squared: ", rsq, "\n")
## R-squared: 0.645433
Regression Formula:
log_price = -345.45
+ 0.0992 * bedrooms
+ 0.1129 * bathrooms
- 0.0003 * number_of_reviews
+ 0.0047 * review_scores_rating
+ 0.0778 * accommodates
- 0.6027 * room_typePrivate room
- 0.9973 * room_typeShared room
+ 0.0262 * cancellation_policystrict
+ 0.5256 * cancellation_policysuper_strict_30
+ 1.6556 * cancellation_policysuper_strict_60
+ 1.5058 * latitude
- 3.8973 * longitude
Each number (called a “coefficient”) shows how the price changes when one thing increases. For example:
More bedrooms = higher price
Private rooms and shared rooms = lower price than entire homes
Higher rating = higher price
Some cancellation policies = higher prices.
The R-squared is 0.6454, which means our model explains about 64.5% of why prices are different. That’s pretty good. It means most of the important things that affect price are in the model.
The RMSE is 0.3872, which means that our model’s predictions are off by about 0.39 (on the log scale) on average. The smaller the RMSE, the better. So, this tells us the model is doing a decent job.
# Load necessary libraries for KNN classification
library(caret)
library(class)
##
## Attaching package: 'class'
## The following objects are masked from 'package:FNN':
##
## knn, knn.cv
# Prepare the data for classification
nyc_data_clean <- dplyr::select(nyc_data, cleaning_fee, log_price, bedrooms, bathrooms, number_of_reviews,
review_scores_rating, accommodates, room_type, cancellation_policy,
host_response_rate, instant_bookable, latitude, longitude)
# Convert cleaning_fee: TRUE/FALSE to factor ("Yes"/"No")
nyc_data_clean$cleaning_fee <- ifelse(nyc_data_clean$cleaning_fee, "Yes", "No")
nyc_data_clean$cleaning_fee <- as.factor(nyc_data_clean$cleaning_fee)
# Select predictors for KNN classification
predictors <- c("bedrooms", "bathrooms", "number_of_reviews", "review_scores_rating",
"accommodates", "room_type", "latitude", "longitude")
# Prepare feature matrix and target vector
X <- nyc_data_clean[, predictors]
y <- nyc_data_clean$cleaning_fee
# Split data into training and testing sets
set.seed(123)
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
train_X <- X[train_index, ]
train_Y <- y[train_index]
test_X <- X[-train_index, ]
test_Y <- y[-train_index]
# Impute missing numeric values using median
numeric_cols <- c("bedrooms", "bathrooms", "review_scores_rating")
for (col in numeric_cols) {
train_X[[col]][is.na(train_X[[col]])] <- median(train_X[[col]], na.rm = TRUE)
test_X[[col]][is.na(test_X[[col]])] <- median(test_X[[col]], na.rm = TRUE)
}
# One-hot encode 'room_type'
train_X <- cbind(train_X[, -which(names(train_X) == "room_type")],
model.matrix(~ room_type - 1, data = train_X))
test_X <- cbind(test_X[, -which(names(test_X) == "room_type")],
model.matrix(~ room_type - 1, data = test_X))
# Scale numeric features
train_X_scaled <- scale(train_X)
test_X_scaled <- scale(test_X, center = attr(train_X_scaled, "scaled:center"),
scale = attr(train_X_scaled, "scaled:scale"))
# Try different k values and find the one with best accuracy
accuracy <- sapply(1:20, function(k) {
pred <- knn(train = train_X_scaled, test = test_X_scaled, cl = train_Y, k = k)
mean(pred == test_Y)
})
# Best k and accuracy
best_k <- which.max(accuracy)
cat("Best k:", best_k, "\n")
## Best k: 19
cat("Best Accuracy:", accuracy[best_k], "\n")
## Best Accuracy: 0.7073736
# Final KNN prediction with best_k
final_pred <- knn(train = train_X_scaled, test = test_X_scaled, cl = train_Y, k = best_k)
# Evaluate the model
conf_matrix <- confusionMatrix(final_pred, test_Y)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 479 480
## Yes 1413 4097
##
## Accuracy : 0.7074
## 95% CI : (0.6961, 0.7184)
## No Information Rate : 0.7075
## P-Value [Acc > NIR] : 0.5171
##
## Kappa : 0.1734
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.25317
## Specificity : 0.89513
## Pos Pred Value : 0.49948
## Neg Pred Value : 0.74356
## Prevalence : 0.29247
## Detection Rate : 0.07405
## Detection Prevalence : 0.14825
## Balanced Accuracy : 0.57415
##
## 'Positive' Class : No
##
In this part, we wanted to predict whether a listing has a cleaning fee or not (yes or no). To do this, we used a machine learning method called K-nearest neighbors (KNN). We first created a new column that showed 1 if there was a cleaning fee and 0 if there wasn’t. Then we selected some variables like bedrooms, bathrooms, how many guests it can accommodate, number of reviews, review scores and room type.
Since KNN works best with numbers, we converted the room type into numeric values and scaled the data, so all features were on the same scale. We then split the data into a training set (for learning) and a testing set (to check how well the model performs).
We tested different values for k (the number of neighbors to consider) and found that k = 19 gave the best accuracy. This means the model looked at the 19 listings most similar to the one it was predicting and based its decision. Finally, we checked the results using a confusion matrix, which showed how many predictions were correct or incorrect.
The host response rate variable was converted to numerical then binned. We wanted to see if reducing the amount of levels for the host response rate and binning it would make a good predictor variable for the model. Cancellation policy was consolidated into three levels for naive bayes and decision tree.
#removed NA from df
nyc_data_clean <- nyc_data %>%
drop_na()
#Naive Bayes
nyc_naive.df <- nyc_data_clean
#str(nyc_naive.df)
#run this line of code to clear global environ except for orig df
#rm(list = setdiff(ls(), "train_csv"))
#convert chr var to factors
#ignoring city b/c data has been filtered for NYC
nyc_naive.df$property_type<-as.factor(nyc_naive.df$property_type)
nyc_naive.df$room_type<-as.factor(nyc_naive.df$room_type)
nyc_naive.df$amenities<-as.factor(nyc_naive.df$amenities)
nyc_naive.df$bed_type<-as.factor(nyc_naive.df$bed_type)
nyc_naive.df$zipcode<-as.factor(nyc_naive.df$zipcode)
nyc_naive.df$thumbnail_url<-as.factor(nyc_naive.df$thumbnail_url)
nyc_naive.df$name<-as.factor(nyc_naive.df$name)
nyc_naive.df$neighbourhood<-as.factor(nyc_naive.df$neighbourhood)
#nyc_naive.df$host_response_rate<-as.factor(nyc_naive.df$host_response_rate)
nyc_naive.df$description<-as.factor(nyc_naive.df$description)
#Converted host_response_rate to num then to factor
#summary(nyc_naive.df$host_response_rate)
nyc_naive.df$host_response_rate <- as.numeric(gsub("%", "", nyc_naive.df$host_response_rate))
nyc_naive.df$host_response_rate <- cut(
nyc_naive.df$host_response_rate,
breaks = c(0, 50, 75, 95, 100),
labels = c("never", "sometimes", "most times", "always"),
include.lowest = TRUE,
right = TRUE
)
summary(nyc_naive.df$host_response_rate)
## never sometimes most times always
## 442 711 2280 14423
#created three outcome classes for cancellation policy
nyc_naive.df$cancellation_policy[nyc_naive.df$cancellation_policy %in% c("super_strict_30", "super_strict_60")] <- "strict"
nyc_naive.df$cancellation_policy<-as.factor(nyc_naive.df$cancellation_policy)
summary(nyc_naive.df$cancellation_policy)
## flexible moderate strict
## 3333 4784 9739
str(nyc_naive.df)
## 'data.frame': 17856 obs. of 29 variables:
## $ id : int 6304928 7919400 5578513 18224863 16679342 14122244 14490881 4734170 421056 15705862 ...
## $ log_price : num 5.13 4.98 4.61 4.6 3.69 ...
## $ property_type : Factor w/ 24 levels "Apartment","Bed & Breakfast",..: 1 1 1 15 15 1 1 1 1 1 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 1 1 2 1 2 1 1 2 2 2 ...
## $ amenities : Factor w/ 16951 levels "{\"Air conditioning\",\"Pets allowed\",Heating,\"Smoke detector\",\"Carbon monoxide detector\",Essentials,Shamp"| __truncated__,..: 2081 6325 4216 11930 3219 8290 6359 15925 7182 6076 ...
## $ accommodates : int 7 5 2 8 2 6 4 3 2 2 ...
## $ bathrooms : num 1 1 1 1 1.5 1 1 1 1 1 ...
## $ bed_type : Factor w/ 5 levels "Airbed","Couch",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ cancellation_policy : Factor w/ 3 levels "flexible","moderate",..: 3 2 3 3 3 3 3 3 3 2 ...
## $ cleaning_fee : chr "True" "True" "True" "True" ...
## $ city : chr "NYC" "NYC" "NYC" "NYC" ...
## $ description : Factor w/ 17724 levels " Great sunny second floor apartment in Brooklyn, Bedstuy. The building is a new construction with eleva"| __truncated__,..: 4835 13822 14838 6306 4283 9109 14380 10118 13613 5805 ...
## $ first_review : chr "2017-08-05" "2017-04-30" "2013-04-28" "2017-07-30" ...
## $ host_has_profile_pic : chr "t" "t" "t" "t" ...
## $ host_identity_verified: chr "f" "t" "t" "f" ...
## $ host_response_rate : Factor w/ 4 levels "never","sometimes",..: 4 4 4 4 4 4 4 2 2 3 ...
## $ host_since : chr "2017-06-19" "2016-10-25" "2013-03-27" "2017-05-09" ...
## $ instant_bookable : chr "t" "t" "f" "t" ...
## $ last_review : chr "2017-09-23" "2017-09-14" "2016-11-15" "2017-09-15" ...
## $ latitude : num 40.8 40.8 40.7 40.6 40.7 ...
## $ longitude : num -74 -73.9 -74 -74 -73.9 ...
## $ name : Factor w/ 17754 levels " 1 Bed Apt in Utopic Williamsburg ",..: 16396 16545 9396 891 16585 9093 10269 3935 11114 9042 ...
## $ neighbourhood : Factor w/ 193 levels "Allerton","Alphabet City",..: 85 84 2 63 53 93 68 127 99 22 ...
## $ number_of_reviews : int 6 10 82 5 57 14 4 40 1 17 ...
## $ review_scores_rating : num 93 92 93 72 89 100 100 88 100 96 ...
## $ thumbnail_url : Factor w/ 17854 levels "https://a0.muscache.com/im/pictures/0000ec1e-e6de-4c4b-b244-be89b344267b.jpg?aki_policy=small",..: 4461 8985 2835 16641 7300 516 15027 10440 4453 2561 ...
## $ zipcode : Factor w/ 260 levels "10001","10001.0",..: 29 38 14 180 155 231 222 25 177 154 ...
## $ bedrooms : num 3 1 1 3 1 3 2 1 1 1 ...
## $ beds : num 3 3 1 3 1 3 2 2 1 1 ...
The variables log_prices and beds were filtered to only include rows with values greater than 0. The numeric variables were binned and different binning strategies were used to test against the modeling accuracy. For example, initially the log_price variable was binned using the the variables summary stats but this created skewed bins.
#Convert num var to factors
#ignoring ID b/c its just an identifier for apt and shouldn't have an affect on the price
#ignoring latitude and longitude since factoring zip code & neighborhood
#convert log_price to factors
summary(nyc_naive.df$log_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.248 4.654 4.719 5.165 7.600
price <- nyc_naive.df %>%filter(log_price <1)
price
## id log_price property_type room_type
## 1 17972519 0 Condominium Shared room
## amenities
## 1 {TV,"Wireless Internet","Air conditioning",Kitchen,Elevator,"Indoor fireplace","Buzzer/wireless intercom",Heating,"Smoke detector","Carbon monoxide detector",Iron,"Host greets you"}
## accommodates bathrooms bed_type cancellation_policy cleaning_fee city
## 1 1 1 Real Bed strict True NYC
## description
## 1 Comfortable dwellings .. My comfy home is for guest travelers and college students on a budget..!! I'm a 3min walk to the 6 train which is 30 mins away from Grand Central Station, Times Square, 34th St and a lot of other tourist areas.. Most of my neighbours are homeowners and are friendly and keep to themselves.. Any questions feel free to ponder then ask.. Comfy dwellings that will make you feel at home.. 40" Mounted TV with Amazon Fire-Stick including Netflix, Hulu, Youtube, Music etc. Access to the kitchen, bathroom and livingroom.. Microwave, toaster, utensils and cutlery to accommodate your needs.. Fresh towels and wash cloths provided.. My cohost and I will provide assistance as needed throughout your stay.. My neighbourhood is located in East-Central part of the Bronx which is 30 mins from Midtown Manhattan.. There is the convenience of several local bus lines that serve my neighbourhood and the 6 train that serves the community.. Macy's, Applebees, IHOP, department stores as
## first_review host_has_profile_pic host_identity_verified host_response_rate
## 1 2017-08-17 t f always
## host_since instant_bookable last_review latitude longitude name
## 1 2017-07-20 t 2017-10-01 40.83805 -73.85867 Esteem's Place
## neighbourhood number_of_reviews review_scores_rating
## 1 Parkchester 3 87
## thumbnail_url
## 1 https://a0.muscache.com/im/pictures/71bc085f-b869-4cda-8b4f-dca694ebbd26.jpg?aki_policy=small
## zipcode bedrooms beds
## 1 10462 1 1
beds <-nyc_naive.df %>%filter(beds <1)
beds
## [1] id log_price property_type
## [4] room_type amenities accommodates
## [7] bathrooms bed_type cancellation_policy
## [10] cleaning_fee city description
## [13] first_review host_has_profile_pic host_identity_verified
## [16] host_response_rate host_since instant_bookable
## [19] last_review latitude longitude
## [22] name neighbourhood number_of_reviews
## [25] review_scores_rating thumbnail_url zipcode
## [28] bedrooms beds
## <0 rows> (or 0-length row.names)
summary(nyc_data_clean$log_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.248 4.654 4.719 5.165 7.600
#filter to remove one row with price of 0 and 0 for beds
nyc_naive.df <- nyc_naive.df %>% filter(log_price > 0)
nyc_naive.df <- nyc_naive.df%>% filter(beds > 0)
summary(nyc_naive.df$log_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.609 4.248 4.654 4.720 5.165 7.600
#table(nyc_naive.df$log_price)
nyc_naive.df$log_price<-cut(nyc_naive.df$log_price,breaks=
quantile(nyc_naive.df$log_price,
probs=seq(0,1,by=0.25),
na.rm=TRUE),labels=c("Student Budget","Below Average","Above Average", "Pricey Digs"),include.lowest=TRUE)
#initial binning using summary() stats
#nyc_naive.df$log_price<- cut(nyc_naive.df$
#log_price,breaks =c(1,4.248,4.654,5.165,8),
#labels=c("Student Budget","Below Average",
#"Above Average", "Pricey Digs"),
#include.lowest=TRUE)
summary(nyc_naive.df$log_price)
## Student Budget Below Average Above Average Pricey Digs
## 4767 4177 4672 4239
#convert accommodates to factors
summary(nyc_naive.df$accommodates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 3.034 4.000 16.000
#table(nyc_naive.df$accommodates)
#orig binning based off summary()stats
#nyc_naive.df$accommodates<- cut(nyc_naive.df$accommodates,breaks =c(1,2,3,4,16),
#labels=c("studio", "small", "medium", "large"),
#include.lowest=TRUE)
nyc_naive.df$accommodates <- cut2(nyc_naive.df$accommodates, g = 5)
levels(nyc_naive.df$accommodates) <- c("studio", "small", "med", "large")
summary(nyc_naive.df$accommodates)
## studio small med large
## 10302 2059 2772 2722
#convert bathrooms to factors
#orig binning attempt based on summary() stats
summary(nyc_naive.df$bathrooms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.137 1.000 6.000
#nyc_naive.df$bathrooms<- cut(nyc_naive.df$bathrooms,breaks =c(-1,1,2.5,8),
#labels=c("one or less","two", "multiple"),
#include.lowest=TRUE)
nyc_naive.df$bathrooms <- cut2(nyc_naive.df$bathrooms, g = 2)
#levels(nyc_naive.df$bathrooms)
levels(nyc_naive.df$bathrooms) <- c("one or less", "multiple")
summary(nyc_naive.df$bathrooms)
## one or less multiple
## 15397 2458
#convert reviews to factors
summary(nyc_naive.df$number_of_reviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 14.00 28.76 37.00 465.00
table(nyc_naive.df$review_scores_rating)
##
## 20 27 40 47 50 53 55 56 57 60 62 63 64 65 66 67
## 25 1 17 2 6 1 1 1 2 109 1 2 3 10 1 16
## 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
## 7 2 46 3 9 45 16 40 25 28 36 33 529 55 81 140
## 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
## 169 229 232 386 418 467 920 635 820 1279 1023 1215 1330 1252 1232 661
## 100
## 4294
nyc_naive.df$number_of_reviews<- cut(nyc_naive.df$number_of_reviews,breaks =c(0,1,5,20,474),
labels=c("none","low", "average", "high"),
include.lowest=TRUE)
summary(nyc_naive.df$number_of_reviews)
## none low average high
## 1351 3904 5525 7075
#convert ratings to factors
summary(nyc_naive.df$review_scores_rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 91.00 95.00 93.76 99.00 100.00
nyc_naive.df$review_scores_rating<- cut(nyc_naive.df$review_scores_rating,breaks =c(20,91,96,100),
labels=c("low","high", "top"),
include.lowest=TRUE)
#tried binning this way but recieved error
#nyc_naive.df$review_scores_rating <- cut2(nyc_naive.df$review_scores_rating, g = 3)
#levels(nyc_naive.df$review_scores_rating) <- c("low", "high", "top")
#summary(nyc_naive.df$review_scores_rating)
#bedrooms to factors
#summary(nyc_naive.df$bedrooms)
nyc_naive.df$bedrooms<- cut(nyc_naive.df$bedrooms,breaks =c(-1,0,1,10),
labels=c("studio","one", "multiple"),
include.lowest=TRUE)
summary(nyc_naive.df$beds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.667 2.000 18.000
nyc_naive.df$beds<- cut(nyc_naive.df$beds,breaks =c(0,1,2,18),
labels=c("one","two","many"),
include.lowest=TRUE)
summary(nyc_naive.df$beds)
## one two many
## 11123 3946 2786
The logical variables in the data set were factored as well.
#turn log var to factors
nyc_naive.df$host_identity_verified<-as.factor(nyc_naive.df$
host_identity_verified)
nyc_naive.df$cleaning_fee <- as.factor(nyc_naive.df$cleaning_fee)
nyc_naive.df$host_has_profile_pic <- as.factor(nyc_naive.df$
host_has_profile_pic)
nyc_naive.df$instant_bookable <- as.factor(nyc_naive.df$instant_bookable)
The dataset was partitioned into a training data frame with 60% of the data and a validation data frame with 40% of the data.
#split data 60/40
set.seed(1)
train_2.index <- sample(rownames(nyc_naive.df),dim(nyc_naive.df)[1]*0.6)
valid_2.index <- setdiff(rownames(nyc_naive.df), train_2.index)
train_2.df <- nyc_naive.df[train_2.index, ]
valid_2.df <- nyc_naive.df[valid_2.index, ]
Bar plots were used to explore the relationships between predictor variables and price categories, providing insight into which predictors were most relevant for inclusion in the Naive Bayes model.
#View bar plots to view relationships between price
ggplot(train_2.df,aes(x=room_type, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by Room Type", y = "Proportion" ) +
theme_minimal()
#room type seems to have an effect on price
Room type seems to have an effect on the price of a listing. For example, entire homes and apartments have more pricey listings compared to private and shared rooms. But, listings at the student budget and below average rates are both heavily prevalent in the private and shared rooms.
ggplot(train_2.df,aes(x=accommodates, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by Accommodates", y = "Proportion" ) +
theme_minimal()
#The amount of people seems to have an effect on price
The amount of people the listing can accommodate seems to also have an effect on the price. Listings that can accommodate more people have more expensive listings compared to lower accommodations.
ggplot(train_2.df,aes(x=bathrooms, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by # of Bathrooms", y = "Proportion" ) +
theme_minimal()
#Number of bathrooms seems to have an effect on price
The number of bathrooms in a listing appears to have less influence on price compared to other variables. Interestingly, listings with one or fewer bathrooms show a higher proportion of above average price listings than those with multiple bathrooms. This may be explained by the greater representation of one bathroom listings in the dataset, which could be skewing the observed relationship.
ggplot(train_2.df,aes(x=host_identity_verified, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by Host Verification", y = "Proportion" ) +
theme_minimal()
#host verification doesn't seem to have much effect on price
Host identify verification doesn’t seem to have much of an effect on listing prices. All levels of the listing price are almost proportioned equally for hosts that are or are not verified. Host verification was not chosen for the naive bayes model.
ggplot(train_2.df,aes(x=instant_bookable, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by Booking Option", y = "Proportion" ) +
theme_minimal()
#book option doesn't seem to have much effect on price
Booking options also appear to have minimal impact on listing price. All price categories show similar proportions across the different booking options, suggesting that this variable does not strongly influence pricing.
ggplot(train_2.df,aes(x=review_scores_rating, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by review ratings", y = "Proportion" ) +
theme_minimal()
#review scores doesn't seems to have much effect on price
Review ratings also seems to have low impact on the listing price.
ggplot(train_2.df,aes(x=bedrooms, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by # of Bedrooms", y = "Proportion" ) +
theme_minimal()
#number bedrooms seems to have an effect on price
The number of bedrooms in a listing appears to have a stronger influence on price compared to other variables. For instance, listings with multiple bedrooms are less commonly found in the student budget category, whereas one bedroom listings are more prevalent. Interestingly, studio apartments do not show as much representation in the student budget category.
ggplot(train_2.df,aes(x=beds, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by # of Beds", y = "Proportion" ) +
theme_minimal()
#number of beds seems to have an effect on price
The number of beds seems to also have a larger impact on the price of a listing compared to other variables.
ggplot(train_2.df,aes(x=cancellation_policy, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by cancellation policy", y = "Proportion" ) +
theme_minimal()
#cancellation policy seems to have small effect on price
The cancellation policy has a smaller impact on the listing price compared to other variables.
ggplot(train_2.df,aes(x=cleaning_fee, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by cleaning fee", y = "Proportion" ) +
theme_minimal()
#cleaning fee seems to have an effect on price
Cleaning fee seems to have a lower impact on the price of a listing compared to other variables.
ggplot(train_2.df,aes(x=host_response_rate, fill = log_price))+
geom_bar(position="fill") +
labs(title="Price by host response rate", y = "Proportion" ) +
theme_minimal()
#host response rate seems to not have an effect on price ```{r}
Host response rate doesn’t seem to have much of an impact on the price of a listing.
#selecting five variables
train_2.df <- select(train_2.df,log_price,room_type,accommodates
,bathrooms,bedrooms,cancellation_policy)
valid_2.df <-select(valid_2.df,log_price,room_type,accommodates
, bathrooms,bedrooms,cancellation_policy)
#initially chose room_type,review_scores_rating
#bathrooms,beds, cleaning_fee with 49% accuracy rate
#room_type,review_scores_rating,bathrooms,bedrooms, cleaning_fee accuracy of52%
#room_type,review_scores_rating,bathrooms,bedrooms,cancellation_policy 52%
#room_type,accommodates, bathrooms,bedrooms,cancellation_policy 52%
#check
names(train_2.df)
## [1] "log_price" "room_type" "accommodates"
## [4] "bathrooms" "bedrooms" "cancellation_policy"
names(valid_2.df)
## [1] "log_price" "room_type" "accommodates"
## [4] "bathrooms" "bedrooms" "cancellation_policy"
#run naive bayes
price.nb <- naiveBayes(log_price ~ ., data = train_2.df)
price.nb
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Student Budget Below Average Above Average Pricey Digs
## 0.2648184 0.2356950 0.2577243 0.2417623
##
## Conditional probabilities:
## room_type
## Y Entire home/apt Private room Shared room
## Student Budget 0.0440606274 0.8829749736 0.0729643990
## Below Average 0.3120792079 0.6689108911 0.0190099010
## Above Average 0.7547989859 0.2404925752 0.0047084390
## Pricey Digs 0.9528957529 0.0467181467 0.0003861004
##
## accommodates
## Y studio small med large
## Student Budget 0.86253084 0.08248149 0.04194572 0.01304195
## Below Average 0.73940594 0.10613861 0.11326733 0.04118812
## Above Average 0.47736327 0.15682724 0.22600507 0.13980442
## Pricey Digs 0.20579151 0.12123552 0.26061776 0.41235521
##
## bathrooms
## Y one or less multiple
## Student Budget 0.85618611 0.14381389
## Below Average 0.91247525 0.08752475
## Above Average 0.93734154 0.06265846
## Pricey Digs 0.74092664 0.25907336
##
## bedrooms
## Y studio one multiple
## Student Budget 0.02502644 0.94888967 0.02608389
## Below Average 0.08118812 0.84633663 0.07247525
## Above Average 0.14885911 0.65447302 0.19666787
## Pricey Digs 0.09459459 0.39266409 0.51274131
##
## cancellation_policy
## Y flexible moderate strict
## Student Budget 0.2428622 0.2759958 0.4811421
## Below Average 0.2344554 0.2942574 0.4712871
## Above Average 0.1647954 0.2966317 0.5385730
## Pricey Digs 0.1135135 0.2154440 0.6710425
#training data
confusionMatrix(predict(price.nb,newdata=train_2.df),train_2.df$log_price)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Student Budget Below Average Above Average Pricey Digs
## Student Budget 2544 1596 584 85
## Below Average 147 172 145 66
## Above Average 121 578 1418 1000
## Pricey Digs 25 179 614 1439
##
## Overall Statistics
##
## Accuracy : 0.5202
## 95% CI : (0.5107, 0.5297)
## No Information Rate : 0.2648
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3547
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: Student Budget Class: Below Average
## Sensitivity 0.8967 0.06812
## Specificity 0.7124 0.95628
## Pos Pred Value 0.5290 0.32453
## Neg Pred Value 0.9504 0.76893
## Prevalence 0.2648 0.23569
## Detection Rate 0.2375 0.01606
## Detection Prevalence 0.4489 0.04947
## Balanced Accuracy 0.8046 0.51220
## Class: Above Average Class: Pricey Digs
## Sensitivity 0.5136 0.5556
## Specificity 0.7863 0.8993
## Pos Pred Value 0.4549 0.6376
## Neg Pred Value 0.8232 0.8639
## Prevalence 0.2577 0.2418
## Detection Rate 0.1324 0.1343
## Detection Prevalence 0.2910 0.2107
## Balanced Accuracy 0.6500 0.7274
# holdout data
confusionMatrix(predict(price.nb, newdata=valid_2.df), valid_2.df$log_price)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Student Budget Below Average Above Average Pricey Digs
## Student Budget 1710 1044 388 61
## Below Average 98 103 99 43
## Above Average 98 382 972 615
## Pricey Digs 24 123 452 930
##
## Overall Statistics
##
## Accuracy : 0.5202
## 95% CI : (0.5085, 0.5318)
## No Information Rate : 0.2702
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3523
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: Student Budget Class: Below Average
## Sensitivity 0.8860 0.06235
## Specificity 0.7135 0.95628
## Pos Pred Value 0.5339 0.30029
## Neg Pred Value 0.9441 0.77217
## Prevalence 0.2702 0.23131
## Detection Rate 0.2394 0.01442
## Detection Prevalence 0.4485 0.04803
## Balanced Accuracy 0.7998 0.50932
## Class: Above Average Class: Pricey Digs
## Sensitivity 0.5086 0.5640
## Specificity 0.7907 0.8910
## Pos Pred Value 0.4702 0.6082
## Neg Pred Value 0.8150 0.8719
## Prevalence 0.2676 0.2309
## Detection Rate 0.1361 0.1302
## Detection Prevalence 0.2894 0.2141
## Balanced Accuracy 0.6497 0.7275
price.nb$tables$room_type
## room_type
## Y Entire home/apt Private room Shared room
## Student Budget 0.0440606274 0.8829749736 0.0729643990
## Below Average 0.3120792079 0.6689108911 0.0190099010
## Above Average 0.7547989859 0.2404925752 0.0047084390
## Pricey Digs 0.9528957529 0.0467181467 0.0003861004
## predict probabilities
pred.prob <- predict(price.nb, newdata=valid_2.df, type="raw")
## predict class membership
pred.class <- predict(price.nb, newdata=valid_2.df)
df_2 <- data.frame(actual=valid_2.df$log_price, predicted=pred.class, pred.prob)
test_apt <- data.frame(cancellation_policy="moderate",accommodates="medium",
room_type = "Entire home/apt", bathrooms ="multiple",
bedrooms= "multiple")
pred_prob <- predict(price.nb, newdata=test_apt, type="raw")
pred_class <- predict(price.nb, newdata = test_apt)
pred_prob
## Student Budget Below Average Above Average Pricey Digs
## [1,] 0.001620732 0.01842043 0.09540119 0.8845576
pred_class
## [1] Pricey Digs
## Levels: Student Budget Below Average Above Average Pricey Digs
#predicted Above Average
Lift charts were used to evaluate the model’s performance relative to random guessing. The results indicate that, for each price category, the model consistently outperforms random selection.
#Lift chart for Pricey Digs
actual_pricey <- factor(ifelse(valid_2.df$log_price == "Pricey Digs", "yes","no"
))
prob_pricey <-pred.prob[,"Pricey Digs"]
lift_pricey<- lift(actual_pricey ~ prob_pricey,data=data.frame(actual_pricey,prob_pricey), class="yes")
plot(lift_pricey,main = "Lift Chart for Predicting Pricey Digs")
#lift chart for below average
actual_below<- factor(ifelse(valid_2.df$log_price == "Below Average", "yes","no"))
prob_below <-pred.prob[,"Below Average"]
lift_below<- lift(actual_below ~ prob_below,data=data.frame(actual_below,prob_below), class="yes")
plot(lift_below,main = "Lift Chart for Predicting Below Budget")
#lift chart for student budget
actual_budget <- factor(ifelse(valid_2.df$log_price == "Student Budget", "yes","no"))
prob_budget <-pred.prob[,"Student Budget"]
lift_budget<- lift(actual_budget ~ prob_budget,data=data.frame(actual_budget,prob_budget), class="yes")
plot(lift_budget,main = "Lift Chart for Predicting Student Budget")
#lift chart for above average
actual_above <- factor(ifelse(valid_2.df$log_price == "Above Average", "yes","no"))
prob_above <-pred.prob[,"Above Average"]
lift_above<- lift(actual_above ~ prob_above,data=data.frame(actual_above,prob_above), class="yes")
plot(lift_above,main = "Lift Chart for Predicting Above Average")
A Naive Bayes model was built to classify listings into four categories: student budget, below average, above average, and pricey digs. Character and logical variables were converted to factors, and numeric variables were binned either by their summary statistics or equal-frequency.Because the log_price distribution was skewed, it was too binned using equal-frequency. The final five variables that were selected as predictors for the model were: room type, accommodates, bathrooms, bedrooms, and cancellation policy.
Despite experimenting with different binning techniques and variable combinations, the model’s accuracy remained between 49% - 52%. However, lift chart analysis confirms it still outperforms random guessing. Its accuracy on the validation set closely matched that on the training set, indicating stable performance. A fictional apartment listing was created with the following variable attributes: room type: Entire home/apt, bathrooms: two, bedrooms: 2, accommodates: 4, and cancellation policy: strict. When we tested the fictional apartment listing, the model classified it as “Above Average.”
#classification tree
levels(nyc_data_clean$cancellation_policy)
## NULL
summary(nyc_data_clean$cancellation_policy)
## Length Class Mode
## 17856 character character
#str(nyc_data_clean)
#remove neighborhoud, thumbnail,zipcode,name,id,dates,description, lat, long
#b/c too many levels & chr var
#factor bed_type and host_response rate, prop type, room_type
nyc_tree.df <- nyc_data_clean %>%
# Drop ID and zip code columns.
select(-c(id,city,description,amenities,first_review,host_since,last_review,
name,neighbourhood,thumbnail_url,zipcode, latitude, longitude))
#str(nyc_tree.df)
#table(nyc_tree.df$host_response_rate)
nyc_tree.df$host_response_rate <- as.numeric(gsub("%", "", nyc_tree.df$host_response_rate))
nyc_tree.df$host_response_rate<-cut(nyc_tree.df$
host_response_rate,breaks=c(0,50,75,
100),
labels=c("rarely","sometimes","always"),
include.lowest=TRUE)
summary(nyc_tree.df$host_response_rate)
## rarely sometimes always
## 442 711 16703
#bed_type,room_type,property_type from chr to factor
nyc_tree.df$bed_type<-as.factor(nyc_tree.df$bed_type)
#table(nyc_tree.df$room_type)
nyc_tree.df$room_type<-as.factor(nyc_tree.df$room_type)
#table(nyc_tree.df$property_type)
nyc_tree.df$property_type<-as.factor(nyc_tree.df$property_type)
#bin cancellation policy to: flexible, moderate, strict
#nyc_tree.df$cancellation_policy <-as.character(nyc_tree.df$cancellation_policy)
nyc_tree.df$cancellation_policy[nyc_tree.df$cancellation_policy %in% c("super_strict_30", "super_strict_60")] <- "strict"
nyc_tree.df$cancellation_policy<-as.factor(nyc_tree.df$cancellation_policy)
#turn log var to factors
nyc_tree.df$host_identity_verified<-as.factor(nyc_tree.df$
host_identity_verified)
nyc_tree.df$cleaning_fee <- as.factor(nyc_tree.df$cleaning_fee)
nyc_tree.df$host_has_profile_pic <- as.factor(nyc_tree.df$
host_has_profile_pic)
nyc_tree.df$instant_bookable <- as.factor(nyc_tree.df$instant_bookable)
#Partition data
str(nyc_tree.df)
## 'data.frame': 17856 obs. of 16 variables:
## $ log_price : num 5.13 4.98 4.61 4.6 3.69 ...
## $ property_type : Factor w/ 24 levels "Apartment","Bed & Breakfast",..: 1 1 1 15 15 1 1 1 1 1 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 1 1 2 1 2 1 1 2 2 2 ...
## $ accommodates : int 7 5 2 8 2 6 4 3 2 2 ...
## $ bathrooms : num 1 1 1 1 1.5 1 1 1 1 1 ...
## $ bed_type : Factor w/ 5 levels "Airbed","Couch",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ cancellation_policy : Factor w/ 3 levels "flexible","moderate",..: 3 2 3 3 3 3 3 3 3 2 ...
## $ cleaning_fee : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ host_has_profile_pic : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ host_identity_verified: Factor w/ 2 levels "f","t": 1 2 2 1 2 2 1 1 2 2 ...
## $ host_response_rate : Factor w/ 3 levels "rarely","sometimes",..: 3 3 3 3 3 3 3 2 2 3 ...
## $ instant_bookable : Factor w/ 2 levels "f","t": 2 2 1 2 1 1 1 1 1 2 ...
## $ number_of_reviews : int 6 10 82 5 57 14 4 40 1 17 ...
## $ review_scores_rating : num 93 92 93 72 89 100 100 88 100 96 ...
## $ bedrooms : num 3 1 1 3 1 3 2 1 1 1 ...
## $ beds : num 3 3 1 3 1 3 2 2 1 1 ...
summary(nyc_tree.df$cancellation_policy)
## flexible moderate strict
## 3333 4784 9739
set.seed(1)
idx <- createDataPartition(nyc_tree.df$cancellation_policy, p=0.6, list=FALSE)
train_tree.df <- nyc_tree.df[idx, ]
holdout_tree.df <- nyc_tree.df[-idx, ]
#tree 1
model1 <- rpart(cancellation_policy ~ ., data = train_tree.df,cp=0.001,method = "class")
# plot tree
rpart.plot(model1, extra=1, fallen.leaves=FALSE)
# second plotting option for model1
#rpart.plot(model1, extra = 104,
#branch.lty = 4, shadow.col = "gray", nn = TRUE)
Model1 used the smallest cp value but it has many nodes and potentially making it more complex than needed and prone to overfitting.
#tree 2
model2 <-rpart(cancellation_policy ~ ., data = train_tree.df,xval=5,
cp= 0.0024636,method = "class")
rpart.plot(model2, extra=1, fallen.leaves=FALSE)
#2nd option to plot model2
#rpart.plot(model2, extra = 104,
#branch.lty = 4, shadow.col = "gray", nn = TRUE)
The classification tree plotted from model2 doesn’t include all of the outcome variable’s categories.
model3 <-rpart(cancellation_policy ~ ., data = train_tree.df,xval=5,
cp=0.0018477 ,method = "class")
rpart.plot(model3, extra=1, fallen.leaves=FALSE)
Compared to Model 1, Model 3 was simpler in structure. While it captured all three categories of the outcome variable, this separation occurred only at the last split.
#hyperparameter tuning
control <- trainControl(method="repeatedcv", number=7, search="grid", classProbs
= TRUE, summaryFunction = multiClassSummary)
tunegrid <- expand.grid(cp=seq(from = 0.0001, to = 0.01, by=0.001))
dt_gridsearch <- train(cancellation_policy~., data=train_tree.df, method="rpart"
, metric="Accuracy", tuneGrid=tunegrid, trControl=control
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(dt_gridsearch)
## CART
##
## 10715 samples
## 15 predictor
## 3 classes: 'flexible', 'moderate', 'strict'
##
## No pre-processing
## Resampling: Cross-Validated (7 fold, repeated 1 times)
## Summary of sample sizes: 9184, 9184, 9185, 9185, 9184, 9184, ...
## Resampling results across tuning parameters:
##
## cp logLoss AUC prAUC Accuracy Kappa Mean_F1
## 0.0001 1.9852869 0.6249356 0.4400766 0.5113402 0.1422480 0.4313485
## 0.0011 0.9688574 0.5771863 0.3692732 0.5705098 0.1353342 0.3821414
## 0.0021 0.9608344 0.5765844 0.2984229 0.5691096 0.1297495 0.3828543
## 0.0031 0.9612210 0.5767632 0.1915386 0.5691096 0.1197092 0.3949533
## 0.0041 0.9619828 0.5764966 0.1905609 0.5678966 0.1195936 0.3949533
## 0.0051 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048 NaN
## 0.0061 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048 NaN
## 0.0071 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048 NaN
## 0.0081 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048 NaN
## 0.0091 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048 NaN
## Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
## 0.4287626 0.7125465 0.4415810 0.7183194
## 0.4157701 0.7043482 0.5071865 0.7653037
## 0.4114668 0.7029044 0.4877401 0.7650505
## 0.4065537 0.6993735 0.4907577 0.7687927
## 0.4070165 0.6993289 0.4907577 0.7647393
## 0.4081123 0.6994388 NaN 0.7654225
## 0.4081123 0.6994388 NaN 0.7654225
## 0.4081123 0.6994388 NaN 0.7654225
## 0.4081123 0.6994388 NaN 0.7654225
## 0.4081123 0.6994388 NaN 0.7654225
## Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.4415810 0.4287626 0.1704467 0.5706545
## 0.5071865 0.4157701 0.1901699 0.5600591
## 0.4877401 0.4114668 0.1897032 0.5571856
## 0.4907577 0.4065537 0.1897032 0.5529636
## 0.4907577 0.4070165 0.1892989 0.5531727
## NaN 0.4081123 0.1894545 0.5537755
## NaN 0.4081123 0.1894545 0.5537755
## NaN 0.4081123 0.1894545 0.5537755
## NaN 0.4081123 0.1894545 0.5537755
## NaN 0.4081123 0.1894545 0.5537755
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0011.
plot(dt_gridsearch)
dt_gridsearch$results
## cp logLoss AUC prAUC Accuracy Kappa Mean_F1
## 1 0.0001 1.9852869 0.6249356 0.4400766 0.5113402 0.1422480 0.4313485
## 2 0.0011 0.9688574 0.5771863 0.3692732 0.5705098 0.1353342 0.3821414
## 3 0.0021 0.9608344 0.5765844 0.2984229 0.5691096 0.1297495 0.3828543
## 4 0.0031 0.9612210 0.5767632 0.1915386 0.5691096 0.1197092 0.3949533
## 5 0.0041 0.9619828 0.5764966 0.1905609 0.5678966 0.1195936 0.3949533
## 6 0.0051 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048 NaN
## 7 0.0061 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048 NaN
## 8 0.0071 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048 NaN
## 9 0.0081 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048 NaN
## 10 0.0091 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048 NaN
## Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
## 1 0.4287626 0.7125465 0.4415810 0.7183194
## 2 0.4157701 0.7043482 0.5071865 0.7653037
## 3 0.4114668 0.7029044 0.4877401 0.7650505
## 4 0.4065537 0.6993735 0.4907577 0.7687927
## 5 0.4070165 0.6993289 0.4907577 0.7647393
## 6 0.4081123 0.6994388 NaN 0.7654225
## 7 0.4081123 0.6994388 NaN 0.7654225
## 8 0.4081123 0.6994388 NaN 0.7654225
## 9 0.4081123 0.6994388 NaN 0.7654225
## 10 0.4081123 0.6994388 NaN 0.7654225
## Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
## 1 0.4415810 0.4287626 0.1704467 0.5706545
## 2 0.5071865 0.4157701 0.1901699 0.5600591
## 3 0.4877401 0.4114668 0.1897032 0.5571856
## 4 0.4907577 0.4065537 0.1897032 0.5529636
## 5 0.4907577 0.4070165 0.1892989 0.5531727
## 6 NaN 0.4081123 0.1894545 0.5537755
## 7 NaN 0.4081123 0.1894545 0.5537755
## 8 NaN 0.4081123 0.1894545 0.5537755
## 9 NaN 0.4081123 0.1894545 0.5537755
## 10 NaN 0.4081123 0.1894545 0.5537755
## logLossSD AUCSD prAUCSD AccuracySD KappaSD Mean_F1SD
## 1 0.146099665 0.005329671 0.006805516 0.010355840 0.01145506 0.009940498
## 2 0.020879851 0.008838694 0.059305894 0.005957622 0.01638366 0.010170023
## 3 0.008400957 0.009307469 0.086011836 0.005536329 0.01583959 0.008534250
## 4 0.008359484 0.008884793 0.006283672 0.005010466 0.01760796 NA
## 5 0.008675200 0.009083181 0.007087852 0.005743311 0.01795763 NA
## 6 0.008211788 0.008278361 0.006483607 0.006403378 0.01773355 NA
## 7 0.008211788 0.008278361 0.006483607 0.006403378 0.01773355 NA
## 8 0.008211788 0.008278361 0.006483607 0.006403378 0.01773355 NA
## 9 0.008211788 0.008278361 0.006483607 0.006403378 0.01773355 NA
## 10 0.008211788 0.008278361 0.006483607 0.006403378 0.01773355 NA
## Mean_SensitivitySD Mean_SpecificitySD Mean_Pos_Pred_ValueSD
## 1 0.008001416 0.003447328 0.014474713
## 2 0.010388683 0.004784540 0.011348196
## 3 0.009945135 0.004812990 0.009424466
## 4 0.010959149 0.005288428 NA
## 5 0.010911660 0.005405074 NA
## 6 0.011612401 0.004974672 NA
## 7 0.011612401 0.004974672 NA
## 8 0.011612401 0.004974672 NA
## 9 0.011612401 0.004974672 NA
## 10 0.011612401 0.004974672 NA
## Mean_Neg_Pred_ValueSD Mean_PrecisionSD Mean_RecallSD Mean_Detection_RateSD
## 1 0.004665379 0.014474713 0.008001416 0.003451947
## 2 0.008788118 0.011348196 0.010388683 0.001985874
## 3 0.008477817 0.009424466 0.009945135 0.001845443
## 4 0.006664328 NA 0.010959149 0.001670155
## 5 0.008440020 NA 0.010911660 0.001914437
## 6 0.009256509 NA 0.011612401 0.002134459
## 7 0.009256509 NA 0.011612401 0.002134459
## 8 0.009256509 NA 0.011612401 0.002134459
## 9 0.009256509 NA 0.011612401 0.002134459
## 10 0.009256509 NA 0.011612401 0.002134459
## Mean_Balanced_AccuracySD
## 1 0.005411302
## 2 0.007561411
## 3 0.007307516
## 4 0.008097106
## 5 0.008134632
## 6 0.008279338
## 7 0.008279338
## 8 0.008279338
## 9 0.008279338
## 10 0.008279338
confusionMatrix(dt_gridsearch)
## Cross-Validated (7 fold, repeated 1 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction flexible moderate strict
## flexible 5.3 2.3 2.9
## moderate 0.3 0.7 0.7
## strict 13.0 23.8 51.0
##
## Accuracy (average) : 0.5705
dt_gridsearch$bestTune
## cp
## 2 0.0011
#0.0011 best CP value for classification tree
Hyperparameter tuning was used to determine the most effective cp value for the classification tree. The final model had a cp value of 0.0011.
#final classification tree
model4 <-rpart(cancellation_policy ~ ., data = train_tree.df,xval=5,
cp=0.0011 ,method = "class")
rpart.plot(model4, extra=1, fallen.leaves=FALSE)
rpart.plot(model4,
type = 2,
extra = 104,
fallen.leaves = TRUE,
cex = 0.5)
#create image of tree model b/c difficult to view
png("model4.png", width = 3000, height = 2000, res = 300)
rpart.plot(model4, type = 2, extra = 104, fallen.leaves = TRUE, cex = 0.5)
dev.off()
## quartz_off_screen
## 2
A classification tree model was created to predict the classification of a listing’s cancellation policy. The cancellation policy was binned into three categories: strict, moderate, and flexible. Multiple trees were plotted using different cp values to get a sense of how the tree’s performance and complexity changed. Hyperparameter tuning was then used to determine the most effective cp value, which ended up being 0.0011. This resulted in a more complex tree with many nodes, but trees with higher cp values either excluded the moderate category or only included it at the very last split. This could be due to the moderate class being underrepresented in the data, which makes it harder for simpler trees to include it.
In the end, the model with a cp of 0.0011 was selected. While it’s complex, it is able to represent all three cancellation policy categories throughout the tree. The classification tree model’s average accuracy rate is 57%.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
# Create a subset of neighborhoods
# View neighborhood names
unique_neighborhoods <- nyc_data %>%
filter(!is.na(neighbourhood)) %>%
distinct(neighbourhood) %>%
pull(neighbourhood)
# Select n random neighborhoods
n <- 60 #randomly select n neighborhoods
selected_neighborhoods <- sample(unique_neighborhoods, n)
# Drop NA values
nyc_data_clean <- nyc_data %>%
drop_na()
# Create subset
nyc_subset <- nyc_data_clean %>%
filter(neighbourhood %in% selected_neighborhoods)
# Feature Engineering and Summarization
# Add a column for price
nyc_engineered <- nyc_subset %>%
mutate(
price = exp(log_price),
price_per_person = price / accommodates,
bedroom_to_guest_ratio = bedrooms / accommodates,
has_cleaning_fee = ifelse(cleaning_fee == "True", 1, 0),
is_verified_host = ifelse(host_identity_verified == "t", 1, 0)
)
neighborhood_summary <- nyc_engineered %>%
group_by(neighbourhood) %>%
summarise(
avg_price = mean(price, na.rm = TRUE),
avg_beds = mean(beds, na.rm = TRUE),
avg_reviews = mean(number_of_reviews, na.rm = TRUE),
avg_review_score = mean(review_scores_rating, na.rm = TRUE),
price_per_person = mean(price_per_person, na.rm = TRUE),
bedroom_to_guest_ratio = mean(bedroom_to_guest_ratio, na.rm = TRUE),
pct_cleaning_fee = mean(has_cleaning_fee, na.rm = TRUE),
pct_verified_host = mean(is_verified_host, na.rm = TRUE),
listing_count = n(),
) %>%
column_to_rownames("neighbourhood")
head(neighborhood_summary)
## avg_price avg_beds avg_reviews avg_review_score
## Annadale 106.00000 2.000000 49.00000 99.00000
## Astoria 100.91340 1.618557 31.97113 93.96289
## Battery Park City 186.42857 1.000000 13.28571 96.57143
## Bedford Park 60.83333 1.500000 25.00000 92.66667
## Bensonhurst 85.19231 2.038462 18.19231 92.26923
## Brooklyn Heights 171.59615 1.653846 21.21154 96.00000
## price_per_person bedroom_to_guest_ratio pct_cleaning_fee
## Annadale 26.50000 0.2500000 0.0000000
## Astoria 38.78780 0.5004849 0.7567010
## Battery Park City 93.21429 0.2142857 1.0000000
## Bedford Park 27.17361 0.4791667 0.9166667
## Bensonhurst 39.53388 0.5279762 0.7692308
## Brooklyn Heights 61.36635 0.3883814 0.7692308
## pct_verified_host listing_count
## Annadale 0.0000000 1
## Astoria 0.6515464 485
## Battery Park City 0.4285714 7
## Bedford Park 1.0000000 12
## Bensonhurst 0.4230769 26
## Brooklyn Heights 0.7307692 52
# Standardize features
neighborhood_scaled <- scale(neighborhood_summary)
# Remove NAs and zero-variance columns
neighborhood_scaled_clean <- as.data.frame(neighborhood_scaled) %>%
na.omit()
neighborhood_scaled_clean <- neighborhood_scaled_clean[, apply(neighborhood_scaled_clean, 2, sd, na.rm = TRUE) != 0]
# Determine optimal number of clusters
fviz_nbclust(neighborhood_scaled_clean, kmeans, method = "wss")
# Perform K-means clustering
set.seed(123)
k <- 3
kmeans_result <- kmeans(neighborhood_scaled_clean, centers = k, nstart = 50)
# Assign cluster labels (ensure alignment)
neighborhood_summary_clean <- neighborhood_summary[rownames(neighborhood_scaled_clean), ]
neighborhood_summary_clean$cluster <- as.factor(kmeans_result$cluster)
# Visualize clusters using PCA projection of standardized data
fviz_cluster(kmeans_result,
data = neighborhood_scaled_clean,
geom = "point", repel = TRUE, labelsize = 8,
main = "K-Means Clustering of NYC Neighborhoods",
ellipse.type = "convex")
# Create a new column to join back neighborhood names
neighborhood_summary_named <- neighborhood_summary_clean %>%
rownames_to_column("neighbourhood")
# Create cluster profile summary
cluster_profile <- neighborhood_summary_named %>%
group_by(cluster) %>%
summarise(
avg_price = round(mean(avg_price), 2),
avg_reviews = round(mean(avg_reviews), 1),
avg_review_score = round(mean(avg_review_score), 1),
bedroom_to_guest_ratio = round(mean(bedroom_to_guest_ratio), 2),
pct_cleaning_fee = round(mean(pct_cleaning_fee), 2),
avg_listings_per_neighborhood = round(mean(listing_count), 1),
n_neighborhoods = n()
)
# Print table
print(cluster_profile)
## # A tibble: 3 × 8
## cluster avg_price avg_reviews avg_review_score bedroom_to_guest_ratio
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 87.2 25.3 93.4 0.46
## 2 2 182. 24.8 94.9 0.43
## 3 3 77.5 20.8 98.1 0.61
## # ℹ 3 more variables: pct_cleaning_fee <dbl>,
## # avg_listings_per_neighborhood <dbl>, n_neighborhoods <int>
# Show neighborhoods by cluster
cluster_composition <- neighborhood_summary_named %>%
select(neighbourhood, cluster) %>%
arrange(cluster, neighbourhood)
# View table
print(cluster_composition)
## neighbourhood cluster
## 1 Bedford Park 1
## 2 Bensonhurst 1
## 3 Brooklyn Navy Yard 1
## 4 Brownsville 1
## 5 Canarsie 1
## 6 Claremont 1
## 7 Columbia Street Waterfront 1
## 8 Concourse 1
## 9 Corona 1
## 10 Country Club 1
## 11 Ditmars / Steinway 1
## 12 Dyker Heights 1
## 13 East Elmhurst 1
## 14 East New York 1
## 15 Flatlands 1
## 16 Forest Hills 1
## 17 Gravesend 1
## 18 Jackson Heights 1
## 19 Kew Garden Hills 1
## 20 Kingsbridge 1
## 21 Kingsbridge Heights 1
## 22 Middle Village 1
## 23 Midwood 1
## 24 Morris Park 1
## 25 New Brighton 1
## 26 Randall Manor 1
## 27 Rego Park 1
## 28 Richmond Hill 1
## 29 Riverdale 1
## 30 South Beach 1
## 31 Stapleton 1
## 32 The Rockaways 1
## 33 Westchester Village 1
## 34 Woodside 1
## 35 Astoria 2
## 36 Battery Park City 2
## 37 Brooklyn Heights 2
## 38 Carroll Gardens 2
## 39 Civic Center 2
## 40 Clinton Hill 2
## 41 East Harlem 2
## 42 Hudson Square 2
## 43 Meatpacking District 2
## 44 Midtown 2
## 45 Murray Hill 2
## 46 Nolita 2
## 47 Park Slope 2
## 48 Prospect Heights 2
## 49 South Street Seaport 2
## 50 Upper West Side 2
## 51 West Village 2
## 52 Annadale 3
## 53 Co-op City 3
## 54 Marine Park 3
## 55 Mount Eden 3
## 56 Van Nest 3
In our clustering model, we incorporated a variety of engineered features that describe the market characteristics of the Airbnbs in 45 randomly selected neighborhoods of New York City. These included: the average price of listings (reverse-transformed from a log scale), the average number of reviews as a proxy for demand, the average review score as a measure of customer satisfaction, the ratio of bedrooms to guests (an efficiency metric), the percentage of listings that charge a cleaning fee, and the average number of listings per neighborhood to approximate market density. We selected these variables to ensure a balance between pricing, guest experience, operational practices, and market activity.
The resulting k-means model revealed three distinct clusters. Cluster 1 consists of moderately priced neighborhoods with the highest number of reviews and listings, possibly reflecting high-demand, high-traffic areas with competitive but not top-tier pricing. Cluster 2 represents high-price, low-volume neighborhoods with fewer listings, lower review counts, and the highest cleaning fee incidence, indicating likely premium or luxury areas. Cluster 3 comprises the lowest-priced neighborhoods, with moderate to low reviews and a higher bedroom-to-guest ratio, suggesting more budget-oriented accommodations with more generous space configurations. These insights offer a nuanced segmentation of NYC’s short-term rental market and provide a foundation for future pricing or operational strategies.
In this project, we explored Airbnb listings in NYC through multiple lenses: regression, classification, and clustering. Our goal was to understand what drives price, how to predict it, and how to group different segments of the market.
For predicting log_price, the multiple regression model performed the best. With an R- squared of 0.645 and an RMSE of 0.387, it explained about 65% of the price variation using meaningful predictors like number of bedrooms, bathrooms, review scores, and room type. This model offers strong, interpretable insights and is reliable for price estimation, especially in continuous form.
On the other hand, Naive Bayes and decision tree models attempted to classify listings into price categories. While helpful for broad price segmentation, the model had modest accuracy (52%) and made simplifying assumptions that reduced their predictive power. Naive Bayes assumes independence among variables, which isn’t always realistic, while decision trees struggled with minority categories like “moderate” cancellation policies.
The KNN model for cleaning fee classification achieved 70.74% accuracy but was heavily biased toward the majority class (“yes”). Its low Kappa score (0.17) and poor recall for “no” listings suggest it’s not suitable for this type of unbalanced classification problem.
Lastly, the K-means clustering analysis helped us uncover patterns across 100 NYC neighborhoods. It grouped areas into three meaningful clusters: luxury, high-demand, and budget-friendly markets. While this method doesn’t predict individual prices, it adds valuable market-level context and supports strategic decisions, such as identifying target neighborhoods for expansion or adjusting service offerings.
If our goal is to analyze and predict listing prices (in log scale), the multiple linear regression model is the best choice. It offers both accuracy and clarity, helping users understand what drives price variations across NYC’s Airbnb market.